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A multitude of research has been rising for predicting the behavior of 
different real-world problems through machine learning models. An erratic 
nature occurs due to the augmented behavior and inadequacy of the 
prerequisite dataset for the prediction of water level over different 
fundamental models that show flat or low-set accuracy. In this paper, a 


powerful scaling strategy is proposed for improvised back-propagation 


algorithm using parallel computing for groundwater level prediction on 
Keywords: graphical processing unit (GPU) for the Faridabad region, Haryana, India. 
This paper aims to propose the new streamlined form of a back-propagation 
algorithm for heterogeneous computing and to examine the coalescence of 
artificial neural network (ANN) with GPU for predicting the groundwater 
level. twenty years of data set from 2001-2020 has been taken into 
consideration for three input parameters namely, temperature, rainfall, and 
water level for predicting the groundwater level using parallelized back- 
propagation algorithm on compute unified device architecture (CUDA). This 
employs the back-propagation algorithm to be best suited to reinforce 
learning and performance by providing more accurate and fast results for 
water level predictions on GPUs as compared to sequential ones on central 
processing units (CPUs) alone. 
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1. INTRODUCTION 

Water is an essential resource for the survival of life on the planet. Enlargement in demands of water 
due to increasing population, irrelevant usage, and acceleration of new commercial industry, moderately 
degrade the level of water. To prevent the dearth of water, it is crucial steps for the hydrological researchers 
to measure the quantity of water available and to act immediately to overcome the forthcoming danger [1]. 
Due to the enhancement in artificial neural network (ANN), it acts as a powerful machine approach for 
modeling water-related activity [2]. The deficit in the arbitrary large dataset will tend to fail in prediction 
with high precision on one core processor i.e., central processing unit (CPU), to improve the efficiency of big 
data set substantial hardware to team up with the CPU. The graphical processing unit (GPU) structure 
comprises thousands of cores and each core will act as a computation unit, which will emend the use of 
parallel structure and proffers very high-level thread parallelism [3]—[5]. The present computing structure of 
CPUs and GPUs does not promote the adequate improvement of performance over heterogeneous computing. 
To overcome this issue, a joint approach has been used by combining both CPUs of multi-core environments 
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and GPUs [6], [7]. Due to the demand for an accelerated high computational environment, an algorithm is 
required to decrease the execution time and improves performance [8]. Rather than doing shifting and 
allocating the memory to the host and device allocate a special pointer that can be used by both CPU and 
GPU, this is the concept of unified memory allocation [9]. According to recent advancements in unified 
memory employment, a huge extent of features has been added like page fault handling for GPUs, 
transferring of data when requested, extra memory allotment for GPUs, and counters for accessing the data 
[10]. In the past, two distinct AutoSwap and SmartPool strategies have been applied to minimize GPU 
consumption and it prevents any human intervention [11]. In previous work, the different standard algorithm 
has been tested concerning a parallel version of ANN on compute unified device architecture (CUDA) and 
results shown before results in favor of parallel implementation on GPUs [12]. Matrix multiplication is the 
most time-consuming task when training a large dataset. To minimize computing time and to accelerate the 
processes during preparation, a parallelized matrix multiplication algorithm has been used [13]. 

In comparison to CPUs, substantial work has been undertaken to take advantage of the GPUs for 
tremendous computational functions. As GPUs are the most powerful approach to solve complex problems, 
there is a need to accelerate the hardware for ANN to improve the performance of training. The multicore 
environment of GPU’s structure helps in attaining optimized neural network design for increasing throughput 
[14]. A GPU-based effective computation has been done for optimizing join-order operation for decreasing 
the execution time for complicated queries [15]. Stabilizing the assignment of allocated work on both CPU 
and GPU will improve the efficiency of the static system [16]. Although a massive amount of work has been 
done in the past to improve the matrix multiplication processing speed, the research association is focusing 
on implementing new hardware and pushing past the limit [17]. Training of deep recurrent network (DRN) 
has been evaluated for half-precision floating-point on CUDA [18]. A parallelized version of the back- 
propagation neural network (BPN) algorithm has been implemented on CUDA for GPU to predict the 
fluctuation rate for the foreign exchange market and compared with CPU for overall performance 
improvement [19]. This research work proposed a new parallel BPN algorithm to predict the level of 
groundwater level for the Faridabad zone on 20 years of data with GPU using CUDA framework. 


2. RESEARCH METHOD 

Multilayer back-propagation network works in two phases: forward and backward. In the forward 
phase, inputs are propagated through the input layer to the network and then the resulting vector is produced. 
Now this actual result is compared to the target result, if the results are distinct then an error is generated. In 
the Backward phase, the error generated from the feed-forward phase is used to update the values of weights 
until both the output matches. The machine learning approach provides an assistant to a variety of 
engineering fields [20]. Ina sequential back-propagation network; weight adaptation was contrived to the 
framework based on a spontaneous deviation of error [21]. BPN algorithms have been applied to many 
prediction problems and have become a successful tool for engineers [22]. Traditional or sequential BPN 
algorithmscan improve the convergence rate for better training [23]. 

In parallelized environment multiplication of matrices is executed on GPU to improve its 
acceleration. A function called kernel is used for defining the code on GPU. A kernel is executed by one or 
more threads in that kernel, which implies initiates kernel after splitting into different GPU threads [24]. Each 
thread in the kernel is having its unique id called threadId and it also defines the type of data processed. 
There are one or more blocks available in each kernel, and each block has one or more threads. But before 
the backward pass, the delta function kernel is launch so that it can be used to updates weights and bias in 
simultaneous accessing mode using the multithreaded environment of GPU. Figure 1 shows the parallelism 
in the GPU grid for artificial neural networks, representing the number of blocks in one grid and the number 
of threads in one block [25]. 


2.1. Tiling technique 

The tiling technique is used to solve the square matrix multiplication problem, as in standard square 
multiplication algorithm one thread calculates the one component of the resultant matrix, and both the square 
matrices are stored on global memory, whereas in the tiling technique all the threads in block work together 
to replicate the two tile matrices for multiplication from global to shared memory. The structure of matrices 
is breaking down into tiles, which simplifies the operation of complex matrix multiplication and improves the 
concurrency rate [26]. Figure 2 portrays an example of tile multiplication. 

Here Aix, Bxj are the two given matrices, Cijis the product matrix and w is the width of one tile. As 
every cell in matrix Cis liberated from each other, the parallel calculation can be done for the value of the 
cell. While multiplying the two tile matrices Aj; and Bjxa __syncthreads() function is required to synchronize 
the threads being executed in separate blocks concurrently. As the overall result relies on computation done 
on parallel blocks, there must be a synchronism between different blocks of threads. A systematic approach 
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has been applied to augment the size of tile for matrix multiplication on different kernels i.e., sparse matrix- 


dense (SpMM) and sampled dense-dense (SDDMM) [27]. 
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Figure 2. Tile multiplication [28] 


2.2. Unified memory prefetching 

Another technique used to overcome the overhead of transferring the data from host to device is 
unified memory prefetching; where, the data is fetched before launching the kernel. 
cudaMemPrefetchAsync() is the function used to prefetch the data from unified memory. While evaluating 
unified memory efficiency, an average set of delegate members, coextensive utilization is required [29]. 
Function cudaMalloc() used standard memory allocation for GPU, it returns a pointer that points to the 
starting of GPU memory location. But in unified memory allocation, a new function called 
cudaMallocManaged() is used that will return a pointer and is accessed by both host and device. 


2.3. Coalescing technique 

While executing or computing in parallel, different threads of the same block access the dynamic 
random access memory (DRAM) at the same time and taking together all the access and united to achieve the 
highest memory bandwidth is the work of coalesced technique [30]. In the coalescing technique, the row- 
wise method and column-wise method are used to access the elements of the matrix, i.e., row after row 
execution or column after column. The column-wise method is the best-suited format for GPU to provide the 
maximum usage ratio of 100%, when any associated column is examined, all values will influence to match 
the access pattern of coalesced memory. Given below coalescing technique is shown in Figure 3. In the 
coalescing technique address space is breached into small burst segments. When loading instruction for 
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execution, all threads of a warp are required and if all the thread accessing lies in the same burst segment, 
then that memory access coalesces as it required only one DRAM, shown in Figure 3, whereas in the un- 
coalescing technique, accessing the location through thread lies in different burst sections. In this work, in 
addition to the tiling technique on shared memory, the coalescing technique is also used. 


Coalescing First Load Coalescing Second Load 
T1 T2 T3 T4 ] Tl T2 T3 T4 


Figure 3. Coalescing technique [31] 


3. THE PROPOSED ALGORITHM 

Figure 4 shows the flow chart of the proposed parallelized back-propagation algorithm. For adapting 
the sequential nature of the BPN algorithm, there is a need to parallelize the whole algorithm. A parallelized 
BPN algorithm was implemented for this work to produce the ground water level prediction. Input variables 
are the number of substantial parameters that prevails the predicted output parameters i.e. temperature, 
rainfall, and ground water level has been used for input layer. Generally, network training deploys on one 
hidden layer. Depth of groundwater has been taken as output and all the parameters are normalized between 
(0.1-0.9). Activation function used was sigmoid function as it ranges between (0-1) and exclusively helpful 


for prediction. 
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Figure 4. Flow chart for parallelized BPN algorithm 
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Xi - Input Matrix () 

Wi, j - Connected weights between layers 

Tj - Target Output (Future Groundwater Level) 
Oj - Actual Output 

Ej- Calculated error 

r - Learning rate 

Maximum number of epochs(Iteration) - 100 
8j- Threshold 


Algorithm: proposed algorithm for parallel backward propagation on GPU 


Initialize all weights and bias typically between 0 and 1 

Stepl: for i=l to no of iteration do{//repeat for every number of iteration 

for j=l to pattern do { // for every pattern in the training set 

for each input LayerNetwork j{ Oj=Netj; 

Step 2: for each hidden/outputLayerNetwork j { 

cudaMallocManaged(&X, N*sizeof(float)); 

cudaMallocManaged(&W, N*sizeof(float)); 

Step3: initialize data on CPU for input pattern and weights using function 
cudaMemAdvise (X, count, advice, CPUdeviceld) ; 
cudaMemAdvise (W, count, advice, CPUdeviceld) ; 

Step 4: unified memory prefetching for forward pass from host to GPU using functions 
CudaMemPrefetchAsync(X, N*sizeof (float), device,NULL) ; 

CudaMemPrefetchAsync(W, N*sizeof (float), device,NULL) ; 

Step5: define grid and blocks before calling a kernel 
NetSumj=MatrixMultKernel<<<blocks per grid, threads per block>>> (Oj, Xi) ; 
//While configuring the blocks, 16 threadsperblock and 100 blockspergrid has been used 
Step 6: calculate the weight sum of the inputs to the node by launching MatrixMultKernel ( 
to multiply the two matrix using tiling technique with coalescing shared memory; 
Step 7: add the threshold to the sum& calculate the activation for the node 
Netj=NetSumj +0j ; Ojf=1/+eN%) ; } 

Step 8: propagate the errors backward through the network 

for every node j in the output layer, calculate the error for the output layer 
Ej = Oj(1 — Oj) (Tj — 0j); 

Step 9: prefetch memory from GPU to hostby using the function 
CudaMemPrefetchAsync(X, N*sizeof (float), device,NULL) ; 

CudaMemPrefetchAsync(W, N*sizeof(float), device,NULL); 

Stepl0:Save results on GPU by using function 
cudaMemAdvise (E, count, advice, GPUdeviceld) ; 

Stepll: repeat step 2 to step 7 for the hidden layer 

Step 12: update weights and bias for each weight and bias 

for each weight Wi,j and bias @j 

AWi,j = rEjxj; 

Wi,j =Wi,j+AWi,j; Oj =rEj; 

Oj = Oj + ABs; } }}} 

Step 13: calculate Global Error E = 1/2X(È(Tk — 0k)? ) 

Step 14: prefetch Memory from GPU to host and save results back on GPU 
CudaMemPrefetchAsync (E, N*sizeof (float), device,NULL); 

CudaMemPrefetchAsync (W, N*sizeof (float), device,NULL); 
cudaMemAdvise (E, count, advice, GPUdeviceld) ; 

Step 14: while ((maximum no_ of iteration < than specified) AND (E > than specified) ) 
End of Algorithm 


4. RESULTS AND DISCUSSION 

Implementation of parallelized back-propagation algorithm has been done on CUDA version 10.1 
using Google Collab. Data set has been taken from [32] where total data taken into account comprises 120 
rows; from 2001-2020, i.e., six annual readings skipping one month between two readings. The number of 
rows considered for data training was 90, while the number of rows considered for testing was 30. The 
prediction has been done for the next seven readings. Google collab is a data science research tool from 
Google. It is an open source that offers Jupyter Notebook for assessment. Users can access a variety of 
machine learning libraries as well as stimulating hardware [33]. Google is removing the barriers to entry into 
deep learning for users. Many researchers who do not have access to a large quantity of GPU resources can 
benefit from this tool. It allows GPU access for 12 hours at a time. 

Perform the following steps in the case of a GPU-enabled notebook backend: Go to Google 
collab—click on runtime—change the runtime type by clicking on hardware accelerator—change the run 
time to GPU. An NVIDIA Tesla T4 with 2560 CUDA Cores and CUDA Version of 11.2 was used to 
investigate the results. The NVIDIA system management interface is depicted in Figure 5. 
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This segment deals with the different outcomes and the interpretation of various resultant graphs for 
training execution time, accuracy, error, model loss, and prediction graph over GPU. GPUs deployment is 
distinguishable over CPUs results. Figure 6 shows the plot for the dataset of 120 readings. Here X-axis 
represents the observed months concerning Groundwater level in meters at the Y-axis. Where the blue line 
represents the complete 120 input dataset, the orange line shows the training done by the model on the first 
90 readings and the green line represents the predicted test data by model for the last 30 readings. 
Figures 7(a) and 7(b) shows the execution time and mean squared error (MSE) with the increasing number of 
epochs for both CPU and GPU. Parallelized algorithm with GPU displays better performance with a 
minimum error rate and execution time. 


(> Wed Jun 9 64:48:23 2021 


$----------------------------------------------------------------------------- + 
NVIDIA-SMI 465.27 Driver Version: 4680.32.03 CUDA Version: 11.2 
------------------------------- $----------------------4----------------------+ 
GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC 
Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. 

MIG M. 

ee ee ee 
© Tesla T4 off | eeeeeeee:e0:64.8 OFF | 8 
N/A 55C PO 29W / 70W | 104MiB / 151@9MiB | o% Default 
l | N/A 

$------------------------------- t---------------------- t---------------------- + 

$----------------------------------------------------------------------------- + 
Processes: 

GPU GI cI PID Type Process name GPU Memory 
ID ID Usage 
+----------------------------------------------------------------------------- + 


Figure 5. NVIDIA system management interface 
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Figure 7. Comparing results for CPU vs. GPU (a) execution time comparison and (b) mean squared error 
comparison 
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Table 1 represents the type of error calculated while predicting the value of groundwater level for 
different parameters to evaluate the performance of different learning algorithms. The value of mean absolute 
error (MAE) and mean squared error (MSE) is used to check the efficiency of regression value. Whereas root 
mean square error (RMSE) is the error that shows the standard deviation while predicting based on data set 
records. To evaluate the efficiency of different standards in weather sciences, predicting atmospheric 
conditions, RMSE would be the regular analytical method, while MAE is good at the assessment of different 
models [34]. 


Table 1. Computational error 
Error Type Value 
Mean Absolute Error 0.0696268 
Mean Squared Error 0.0051229 
Root Mean Squared Error _0.0715743 


Figure 8 shows the execution time taken by both CPU and GPU to predict the level of groundwater 
level for twenty years of a dataset. It is clear from the figure that the time taken by GPU using parallelized 
BPN algorithm is less than the time taken by CPU alone for the same data set. The comparison between CPU 


vs. GPU for total execution time, average time per epoch, and memory used has been shown below in 
Table 2. 


- Os 7ms/step - loss: 1.8031le-04 - val loss: 


] - Os 7ms/step - loss: 1.5398e-04 - val_loss: 


- Os 7ms/step - loss: 2.1770e-04 - val loss: 
with CPU time taken in seconds: 10.076101181000013 
with GPU time taken in seconds: 0.7781454485542905 


Figure 8. CPU vs. GPU execution time 


Table 2. CPU vs. GPU 


Package CPU GPU 
Total Time [sec]: 10.07610 0.77814 
Average Seconds/Step: 0.014 0.006 
Memory Used: 0.99GB 1.54 GB 


5. CONCLUSION 

Based on the results of the aforesaid research, it can be concluded that the suggested parallelized 
back-propagation method on GPU predicts groundwater levels in the Faridabad region faster than the CPU 
alone. It should also be noted that the CPU execution time is approximately 10.08 seconds while training and 
testing the network and in contrast, GPU execution time reduces to approximately 0.78 seconds, which is 
approximately a 90.3% improvement. It can be referred from above that parallelized implementation of the 
GPU produces an improved performance compared to CPUs with a minimum error rate of 0.0696268. 


6. FUTURE WORK 

Future work includes the extension of parallelized back-propagation algorithm to other real-world 
problems to boost the acceleration of different hardware for ANN research and for faster GPUs; the power of 
various algorithms must be increased by parallelization. 
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